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Abstract 


ii 


The  design  of  a  RISC  processor  requires  a  careful  analysis 
of  the  tradeoffs  that  can  be  made  between  hardware 
complexity  and  software.  As  new  generations  of  processors 
are  built  to  take  advantage  of  more  advanced  technologies, 
new  and  different  tradeoffs  must  be  considered.  We  examine 
.  the  design  of  a  second  generation  VLSIItlSC  processor, 
MIPS-X.  >  .<K-o ,  .J 


MIPS-X  is  ithe  successor  to  the  MIPS  project  at  Stanford 
University  aad  like  MffSTlt^a  single-chip  32-bit  VLSI 
processor  that  uses  a  simplified  instruction  set,  pipelining  and 
a  software  code  rrnrgsniTrr  Hltiwnrrr.  die  quest  for  higher 
performance,  MIPS-X  uses  a  deeper  pipeline,  a  much  simpler 
instruction  set  and  achieves  the  goal  of  single  cycle  execution 
using  a  2-phase,  20  MHz  clock.  This  has'necessitsted  the 
inclusion  of  an  on-chip  instruction  cache  and  carefol 
consideration  of  the  control  of  die  machine.  Masy  tradeoffs 
were  made  during  the  design  of  MIPS-X  and  this  papa 
examines  several  key  areas.  They  we:  the  organization  of  the 
on-chip  instruction  cache,  die  coprocessor  interface,  branches 
and  the  resulting  branch  delay,  and  exception  handling.  For 
each  issue  we  present  the  most  promising  alternatives 
considered  for  MIPS-X  and  die  approach  Anally  selected. 
Working  parts  have  been  received  and  this  gives  es  a  firm 
basis  upon  which  to  evaluate  the  success  of  our  design. 


Introduction 


The  first  generation  reduced  instruction  set  processors 
(IBM  iOli,  RISC2*3  and  MIPS4*5)  have  shown  the 
importance  of  making  the  correct  tradeoffs  across  the 
boundary  that  separates  hardware  complexity  rod  software 
functionality.  Hardware  should  only  be  used  to  support 
features  that  clearly  improve  performance.  As 
implementation  technology  improves,  new  features  can  be 
considered  and  new  tradeoffs  must  be  made. 


The  goal  of  the  MIPS-X  project  was  to  combine  a  new 
technology,  a  2*im,  2-level  metal  CMOS  process,  with  the 
knowledge  and  experience  gained  from  the  first  generation 
RISC  machines,  to  build  a  single  processor  with  a  peak  rate  of 
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20  MIPS  and  then  to  use  6-10  of  these  processors  as  the  nodes 
in  a  shared  memory  multiprocessor.  The  resulting  machine 
would  be  about  two  orders  of  magnitude  more  powerful  than 
a  VAX  11/780  minicomputer. 


We  describe  here  the  design  of  the  single  processor,  MIPS- 
X.  The  overriding  principle  was  to  keep  the  design  as  simple 
ss  possible.  The  original  MIPS  team  was  heavily  involved  in 
the  initial  architectural  discussions,  and  they  helped  steer 
MIPS-X  away  from  the  kinds  of  trouble  that  they  faced  with 
MIPS.  The  mgjor  areas  of  concern  were  control  related,  of 
which  the  moat  important  were  considered  to  he  instruction 
decode  and  exception  h  sod  ling.  Both  were  not  considered 
early  enough  in  the  MIPS  design  and  created  difficult 
implementation  problems  in  the  final  chip. 


The  design  of  the  instruction  format  was  straightforward 
since  we  religiously  adhered  to  a  «ns«im  given  in  the  first 
working  document  on  MIPS-X.  It  stated,  "The  goal  of  any 
instruction  format  should  be: 

1.  Simple  decode, 

2.  simple  decode,  and 

3.  simple  decode. 

Any  attempts  at  improved  code  density  at  the  expense  of  CPU 
performance  should  be  ridiculed  at  every  opportunity.” 
Neediest  to  say,  all  instruction  sets  considered  for  MIPS-X 
were  fixed  format  32-bit  words  and  the  amount  of  decoding 
was  minimal.  The  effects  of  having  this  simple  instruction 
format  is  discussed  in  the  conclusions. 


Not  all  area*  were  as  stable  as  the  instruction  decode. 
Before  presenting  tire  major  tradeoffs  we  made  in  the  MIPS-X 
design,  the  next  section  describes  the  basic  architecture  of  the 
processor  and  tire  following  section  gives  an  overview  of  the 
hardware  and  organization  of  the  mac  tune.  This  is  followed 
by  several  sections,  each  discussing  a  major  design  issue  in 
MIPS-X,  the  solution  used  and  the  rational  for  that  decision. 


MIPS-X  Architecture 


The  goal  of  tire  MIPS-X  project  was  to  design  a 
microprocessor  with  an  order  of  magnitude  more  performance 
than  the  original  MIPS  processor.  MIPS-X  barrows  heavily 
from  the  original  MIPS  design;  it  is  again  a  heavily  pipelined 
machine,  and  the  resulting  pipeline  interlocks  are  handled  by 
the  supporting  software  system.  MIPS-X  differs  from  MIPS 
in  that  it  «tmx  for  single-cycle  execution  using  s  much  faster 
clock  (20  MHz),  a  deeper  pipeline  and  better  implementation 
technology. 

The  high  instruction  rate  means  that  memory  bandwidth  is 
an  important  consideration.  At  the  projected  clock  frequency 
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of  20  MHz  it  it  very  difficult  to  satisfy  instruction  sod  data 
fetch  requirements  across  the  available  package  pins.  To 
alleviate  this  problem,  MIPS-X  has  a  2K-byte,  on-chip 
instruction  cache  (fcache).  Only  instructions  that  miss  in  the 
(cache  pass  through  the  package  pins.  The  fcache  is  placed 
above  the  datapath,  in  the  area  of  the  chip  that  is  normally 
used  for  microcode  storage  and  processor  control.  Data 
references  and  instruction  references  that  miss  in  the  (cache 
ate  handled  by  a  large  64  K.  word  external  cache  (Ecache). 
The  Ecache  uses  a  shared  bos  to  communicate  with 
memory.  An  added  benefit  of  this  two-level  cache  is  that  it 
provides  a  second  port  to  memory;  the  processor  can  fetch  an 
instruction  from  die  I  cache  at  the  same  time  it  is 
off-chip  data. 

A  deep  pipeline  is  used  to  allow  the  machine  to  start  a  new 
instruction  every  cycle.  Each  instruction  is  divided  into  five 
pipeline  stages.  They  are  described  in  Figure  1.  All  control  is 
hardwired. 


DP 

Instruction  fetch. 

RF 

Instruction  decode  and  register  fetch. 

ALU 

ALU  or  shift  operation 

MEM 

Wait  for  data  from  memory  on  a  load  and  output 
data  for  a  store. 

WB 

Write  fee  result  into  fee  destination  register. 

Figure  1:  MIPS-X  Pipestages 

The  machine  uses  a  load-store  architecture;  the  only 
memory  operations  are  explicit  loads  and  stores.  The  use  of 
the  ALU  cycle  depends  on  the  instruction  being  executed. 
For  compute  instructions,  this  cycle  performs  the  desired 
computation,  for  memory  instructions  it  is  used  to  compute 
the  address  of  the  desired  memory  location  and  for  branch 
instructions,  it  is  used  to  compute  fee  condition.  All  memory 
operations  use  the  same  addressing  mode;  the  contents  of  a 
register  are  added  to  a  17-bit  signed  offset  to  produce  a  32-bit 
address.  There  are  32  general  purpose  registers  in  fee 
datapath  wife  a  32-bit  ALU  and  a  funnel  shifter  for  compute 
operations. 

Although  a  compute  instruction  finishes  its  computation 
during  the  third  pipeline  cycle  (ALU),  fee  result  is  not  written 
back  into  the  register  file  until  fee  last  pipeline  cycle.  This 
delayed  writeback  is  done  to  make  instructions  only  change 
machine  state  during  their  last  pipeline  cycle,  making 
exception  handling  much  easier.  Bypassing  is  used  to  reduce 
the  number  of  pipeline  interlocks. 

All  instructions  are  restartable  so  MIPS-X  will  support  a 
dynamic,  paged  virtual  memory  system.  To  help  implement 
such  a  system,  MIPS-X  supports  both  maskable  and 
nonmaskable  interrupts.  For  systems  requiring  more  complex 
interrupt  handling,  an  external  interrupt  coprocessor  can  be 
added.  MIPS-X  also  provides  two  operating  modes,  system 
and  user,  feat  execute  in  separate  address  spaces  to  provide 
the  protection  needed  to  implement  an  operating  system.  The 
current  mode  is  stored  in  fee  PSW  sod  it  can  only  be  changed 
x  hile  executing  in  system  mode. 


A  Hardware  Overview 

The  major  components  of  MIPS-X  are  the  instruction 
cache  data  array,  the  instruction  register  and  the 
The  datapath  is  composed  of  the  register  file,  the  execution 
unit,  PC  unit  and  the  tag  store  for  fee  instruction  cache.  The 
organization  of  these  parts  is  shown  in  Figure  2. 


Figure  2:  MIPS-X  Flooiplan 


The  instruction  cache  is  organized  as  an  g-wsy  set- 
associative  cache,  wife  4  sets  (rows)  snd  16  words  in  each 
block  (line).  A  sub-block  replacement  scheme  is  used  so 
there  are  512  valid  bits,  one  per  word,  as  well  as  fee  32  tags. 
These  are  located  in  fee  datapath  to  decrease  fee  time  needed 
to  detect  an  instruction  cache  miss. 

The  instruction  register  latches  fee  output  from  the 
instruction  cache  and  predecodes  some  fields  of  each 
instruction.  It  also  controls  the  flow  of  data  during  cache 
misses  so  feat  instructions  can  be  written  into  fee  cache. 
During  a  cache  miss,  fee  instruction  is  latched  in  fee 
instruction  register  from  fee  data  bus  while  it  is  going  to  fee 
cache  memory  array.  This  latch  provides  a  very  useful  testing 
feature  by  allowing  fee  processor  to  run  wife  fee  cache 
disabled. 

The  register  file  contains  31  general  purpose  registers  and 
a  hardwired  constant  zero  register.  It  is  useful  to  have  a 
read-only  register  as  a  place  to  write  unwanted  data.  The 
constant  zero  was  chosen  because  it  is  used  as  s  source  value 
for  many  instructions  such  as  loading  immediate  values  by 
doing  in  add  immediate  to  Register  0.  Registers  to  handle 
two  levels  of  bypassing  and  the  memory  data  registers  are 
also  in  this  section. 

Shifting  and  ALU  operations  are  done  in  fee  execute  unit 
It  contains  a  64-bit  to  32-bit  funnel  shifter  and  s  32-bit  ALU. 
There  is  also  a  special  register,  called  fee  MD  register,  feat  is 
used  during  multiplication  rod  division  instructions. 


The  program  comer,  or  PC  unit,  contain*  a  displacement 
adder  for  branches,  an  Incrementer  and  a  chain  of  drift 
registers  to  save  the  PC  vaiaea  of  the  instructions  currently  in 
execration.  Having  both  the  displacement  adder  and  the 
incrementer  mean*  that  aa  soon  as  the  branch  condition  is 
detemrined  the  PC  bus  can  be  driven  with  the  correct  value. 
The  PC  vaiioes  in  the  shift  chain  are  needed  to  restart  the 
machine  after  an  exception. 

In  a  ainall  area  above  each  section  of  the  datapath  is  local 
instruction  decoding  and  control  for  that  section.  The  overall 
control  of  the  machine  is  handled  by  two  finite  state  machines 
located  in  the  PC  unit  One  of  them  is  used  to  handle  Icache 
misses  and  die  other  one  does  instruction  squashing  during 
eiecptlout  and  branches.  Squashing  an  instruction  converts  it 
into  a  no-op  instruction. 

Critical  Paths 

To  run  the  processor  at  or  above  20  MHz  meant  that  much 
attention  had  to  be  paid  to  possible  critical  paths.  In  each 
cycle,  we  tried  to  minimize  the  number  of  aeries  operation*  as 
much  as  possible.  Whenever  feasible,  a  signal  was  given  a 
fell  phase  to  be  decoded  and  driven  from  one  section  to 
another. 


then  MIPS-X  will  have  an  average  bandwidth  of  26 
MWords/s  and  a  peak  bandwidth  of  40  MWordafa.  Clearly, 
on-chip  memory  would  help  to  alleviate  this  bottleneck.  For 
MIPS-X,  we  built  an  on-chip  5 12- word  instruction  cache  and 
the  tradeoffs  made  in  its  design  are  described  hi  detail 
elsewhere6.  We  will  only  discuss  the  salient  features  here. 

The  instruction  cache  wat  the  first  part  of  the  chip  to  be 
designed.  We  first  fixed  a  die  size  that  we  felt  had  enough 
area  to  implement  the  functionality  we  desired  yet  small 
enough  that  we  could  expect  a  reasonable  yield  of  working 
parts.  The  datapath  and  control  would  take  about  half  of  the 
area  inside  the  padframe  so  the  cache  was  allocated  the 
remaining  area  fixing  its  area  and  aspect  ratio.  The  other 
main  constraint  on  the  cache  wat  that  the  cycle  time  had  to  be 
leas  than  the  50na  clock  cycle.  Given  these  constraints  we 
investigated  many  different  floorplam  and  organizations, 
trying  to  minimize  the  average  coat  of  an  instruction  fetch. 
This  coat  is  a  function  of  the  cache  hit  rale,  the  mi—  penalty, 
and  the  cache  access  time. 

We  found  that  the  performance  of  the  cache  was  more 
sensitive  to  the  the  miss  service  time  than  the  mi«*  ratio.  This 
meant  that  the  implementation  details  of  the  cache  were  more 
important  than  the  cache  organization  because  the 


There  were  a  few  paths  that  we  felt  were  most  likely  to  be 
critical  paths  sod  are  spent  a  lot  of  time  concentrating  on 
them.  The  most  important  of  these  involved  external  data 
fetches.  In  the  specification  for  the  pipeline,  addresses  would 
be  computed  during  61  of  the  ALU  cycle  and  driven  to  the 
address  pads  during  62.  The  Ecache  would  be  accessed 
during  the  MEM  cycle.  Even  assuming  that  the  address  could 
be  driven  off  the  chip  by  the  end  of  ALU,  completing  a  fetch 
in  SO  ns  would  be  tight  became  of  the  address  buffer  delay, 
memory  access  time  aid  setup  time  for  the  fetched  data. 
Getting  die  result  of  the  tag  compare  back  hi  a  cycle  seemed 
impossible  since  this  would  also  involve  delay  through  some 
comparators.  To  ease  the  constraint  on  getting  the  tag 
compare  back,  we  decided  to  use  a  UMt-miss  signal  This 
meant  that  the  cache  would  inform  the  procesror  at  the 
of  the  WB  cycle  whether  the  cache  access  during 
MEM  wu  successful.  If  there  was  a  miss,  then  the  processor 
would  effectively  go  back  and  re-execute  62  of  MEM  to  try 
the  access  again.  This  loop  would  continue  until  the  cache 
got  the  data  and  signaled  a  hit  Throughout  the  design  we  had 
to  be  careful  not  to  unnecessarily  add  delay  to  the  memory- 
fetch  path. 

Other  paths  that  we  tried  to  optimize  included  the  path 
from  branch  condition  generation  to  driving  die  PC  Bus, 
instruction  cache  hit  detection,  squeezing  the  ALU  time  into  1 
phase  to  get  the  address  out  by  the  end  of  the  cycle  sod  doing 
register  reads  and  writes  in  ooe  cycle.  The  latter  turn  were 
strictly  circuit  design  issues  and  are  not  discussed  any  further 
here. 


The  Instruction  Cache 

Advances  in  processor  architecture  and  VLSI  technology 
have  increased  faster  than  the  improvements  in  packaging 
technology.  This  has  meant  that  high-performance  VLSI 
processors  have  become  memory  bandwidth  limited.  For 
example,  if  we  assume  that  ooe  instruction  is  fetched  every 
cycle  while,  on  avenge,  data  is  only  fetched  every  third  cycle. 


implementation  affected  how  quickly  we  could  determine 
whether  an  address  hit  in  the  cache.  With  our  pipelining,  this 
meant  the  difference  between  stalling  the  machine  for  2  or  3 
cycles  on  a  cache  miss.  By  placing  the  tag  and  valid-bit 
stares  in  the  datapath  doae  to  the  PC  unit  a  2-cycle  miss  could 
be  realized.  This  lengthened  the  datapath  by  the  number  of 
cache  tags  and  meant  that  we  could  not  have  smeller  block 
sizes  because  more  tags  would  make  the  datapath  too  long. 
However,  the  benefits  of  having  fewer  cache  min  cycles  far 
outweighed  the  slightly  lower  miss  rates  achievable  by  having 
smaller  blocks. 

Initial  simulations  of  this  organization  yielded 
disappointing  results.  Using  a  set  of  medium  size  programs 
we  achieved  min  rates  that  avenged  over  20%.  We  felt  that 
real  programs  would  have  worse  min  rates,  pushing  the  cost 
of  an  instruction  fetch  close  to  1.5  cycles.  We  found  a  way  to 
reduce  the  Dumber  of  cache  min  cycles  to  1  by  writing  the 
missed  instruction  into  the  Icache  u  soon  n  it  got  beck  onto 
the  chip,  but  since  acccremg  external  data  urn  already  ooe  of 
the  critical  paths  we  did  oot  want  to  risk  extending  the  cycle 
rime  to  complete  the  write.  Instead  we  realized  that  the  2 
cache  min  cycles  could  be  used  to  fetch  hack  2  instructions, 
the  one  that  missed  and  the  next  one  to  be  executed.  Doing 
this  double  fetch  did  not  affect  the  critical  petit  and,  hi  fact, 
wn  easier  to  do  than  fetching  back  only  one  instruction 
because  it  minimi  red  the  disruption  of  the  pipeline.  Fetching 
back  2  words  almost  halves  die  min  ratio,  driving  down  the 
cost  of  an  instruction  fetch  to  that  of  s  single-cycle  min.  The 
key  realization  here  wn  that  there  wn  extra  cache  bandwidth 
available  and  that  we  could  use  it  to  fetch  back  the  next 
instruction,  significantly  improving  the  cache  min  ratio 
without  impacting  the  cycle  time  of  the  machine.  Fetching 
back  more  words  would  not  be  advantageous  because  the 
bandwidth  of  the  cache  is  fully  used. 

Trace  driven  simulations  show  that  with  our  set  of  large 
Pascal  and  Lisp  benchmarks,  the  cache  hre  an  avenge  mis* 
rate  of  12%  resulting  in  an  average  instruction  executing  in 
1.24  cycles. 


The  Coprocessor  Interface 

The  coprocessor  interface  wai  considered  from  die  *cty 
iMgimiiwg  of  the  design.  It  alao  led  to  tome  of  die  moat 
Intimating  djecaiaione  within  the  MIPS-X  design  team.  We 
spent  considerable  time  hying  to  find  in  efficient  interface 
diet  woeld  give  reaaooabie  performance  and  still  fit  within  the 
wiMirai.it>  nf  vi  .q  peckaging  <t— <g«  This  problem  waa 
exacerbated  by  the  presence  of  the  co-chip  instruction  cache, 
since  now  aO  inetraetione  woeld  not  be  visible  to  die  outside 
world. 

The  proposal  far  the  first  instruction  set  had  a  tingle  bit  in 
every  instruction  to  specify  whether  the  instruction  waa  for 
die  CPU  or  a  coprocessor.  Par  instructions  with  the 
coprocessor  bit  set,  MIPS-X  would  perform  all  the  addressing 
calculations,  but  would  not  affect  any  of  its  stored  data.  That 
is,  all  coproceaaor  memory  instructions  still  used  the 
processor  to  generate  the  addresses  and  the  required  control 
signals,  while  the  coprocessor  either  acted  as  a  source  or  sink 
of  the  data.  To  make  the  coprocessor  instructions  visible 
outride  of  the  proceeror,  e  dedicated  bus  was  required  to 
transfer  die  instruction  off  the  processor  chip.  This  scheme 
had  2  disadvantages:  an  interprocessar  communication  had  to 
go  through  memosy,  and  e  coproceaaor  bus  was  required.  A 
minor  concern  wee  thri  half  the  opcode  space  was  devoted  to 
die  coprocessor,  there  had  to  be  a  more  efficient  encoding. 

The  next  instruction  format  divided  the  opcode  q»ce  into 
three  instruction  types:  memory  operations,  brandies  and 
compute  operations.  The  memory  and  compute  instructions 
had  a  3-bit  field  to  specify  the  coproceaaor  number,  branches 
were  only  done  on  the  main  processor.  If  Coprocessor  0  was 
specified  then  the  instruction  was  for  the  mam  processor, 
otherwise  the  instruction  was  for  one  of  the  7  available 
coprocemora.  To  branch  on  a  coprocessor  condition,  the 
coproceaaor  would  first  be  told  to  assert  a  single  input  to  the 
main  processor  and  a  branch  on  coprocessor  tree  or  branch 
on  coprocessor  false  would  be  executed  to  test  the  status  of 
that  input  Several  coprocessors  could  be  connected  by  wire- 
oring  their  outputs.  This  scheme  still  had  die  problem  that 
data  transfers  between  processors  must  be  done  through 
memory. 

It  was  then  proposed  that  all  coprocessor  instructions  must 
be  noo-cached,  removing  the  need  for  a  coprocessor  bus.  The 
iasae  of  pina  and  pin  bandwidth  was  heavily  debated  within 
die  MIPS-X  design  team.  Pina  on  the  processor  were  in  short 
supply  ad  devoting  approximately  20  of  them  to  the 
coprocessor  interface  seemed  excessive.  The  question  was 
not  Just  whether  there  were  enough  pins  available.  Without 
the  coprocessor  bus,  MIPS-X  would  need  only  about  90 
signal  pina,  a  relatively  small  number  by  today's  standards. 
Rather  the  argument  focused  on  what  would  be  the  best  use  of 
these  pina  if  we  had  them.  It  was  not  at  all  clear  that  using 
dwm  for  the  coprocessor  interface  was  the  most  effective  ase 
of  die  pins.  To  prevent  coprocessor  instructions  from  being 
cached,  a  bit  in  the  instruction  cache  would  be  set  when  so 
instruction  being  tended  waa  detected  to  be  a  coproceeeor 
toeteacdon.  If  the  bit  was  set  during  an  instruction  fetch  that 
the  coprocessor  would  get  the  instruction  off  the 
memory  bus  as  the  main  proceaaor  read  the  instruction  from 
memory  during  the  cache  mtoa  cycle. 

The  obvious  disadvmtage  of  this  approach  waa  that  aO 
coproceaaor  operationi  incurred  an  overhead  from  die  internal 


cache  miss.  Our  initial  benchmarks  indicated  that  this  would 
not  cause  a  significant  performance  loss,  but  when  we 
generated  traces  from  some  floating  point  intensive  code  we 
realized  a  percentage  of  the  instructions  were 

floating  point  instructions.  This  caused  a  re-examination  of 
the  decision  to  not  cache  coprocessor  instructions,  and  led  to 
the  coprocessor  scheme  that  was  finally  choaen. 

The  opcode  encoding  of  the  machine  was  changed  again, 
this  time  making  coprocessor  operations  a  form  of  memory 
operation  or  more  accurately,  memory  instructions  became  a 
type  of  coproceaaor  instruction.  Coprocessor  instructions 
work  in  this  scheme  by  using  the  address  lines  to  transmit  the 
coprocessor  instruction.  A  memory  instruction  takes  a  17-bit 
offset  constant  and  adds  it  to  the  contents  of  a  register  to 
compute  the  memory  address.  If  the  memory  system  ignores 
the  cycle,  it  is  possible  to  pass  the  17-bit  offset  constant  to  a 
coprocessor  as  an  instruction.  The  instruction  would  include 
a  3-bit  field  to  specify  the  coprocessor  being  addressed, 
although  the  processor  does  not  oeed  to  know  the  format  of 
these  instructions.  This  scheme  has  several  advantages  over 
our  earlier  ideas.  A  coprocessor  instruction  bus  is  not 
required,  since  the  instructions  are  sent  out  over  the  address 
pins.  Only  one  extra  pin  is  required  to  toll  the  memory 
system  to  ignore  the  cycle.  Additional  pina  can  now  be  used 
for  alleviating  the  pin  bandwidth  problem  in  other  parts  of  the 
system.  Using  coprocessor  load  and  store  instructions,  data 
can  be  directly  transferred  between  processors  by  making  the 
coprocessor  supply  or  read  data  on  the  data  bus  instead  of  the 
memory.  Also,  the  coprocessor  instructions  can  be  cached 
just  like  all  the  other  instructions.  The  disadvantages  of  this 
scheme  are  that  there  are  fewer  bits  to  specify  the  coprocessor 
instructions,  and  all  data  to  and  from  the  coprocessor's 
registers  must  be  transferred  through  the  main  processor 
registers  first  before  it  can  be  sent  to  memory. 

Having  to  transfer  all  data  through  the  main  processor 
registers  was  still  thought  to  be  inefficient  for  heavy  floating 
point  computation.  This  lead  to  a  further  modification  of  the 
instruction  set  to  add  load  floaing  and  Mora  floating 
instructions.  These  instructions  provide  one  special 
coprocessor  with  its  own  load  and  store  instructions,  which 
we  assume  will  be  a  floating  point  unit  (FPU).  The  interface 
now  allows  one  special  coprocessor  to  load  and  store  its 
registers  directly  to  memory,  without  passing  through  the 
main  processor,  in  a  single  instruction.  All  other  coprocessors 
require  one  extra  cycle  for  memory  toadafstaraa. 

One  final  tweaking  of  the  interface  waa  to  remove  the 
coprocessor  branch  instructions.  The  main  reason  for  their 
removal  was  the  problem  of  saving  stale  in  the  coprocessors 
across  exceptions.  The  solution  was  to  just  read  a 
coprocessor  status  register  into  a  main  processor  register  and 
thro  branch  according  to  the  value  of  that  register.  This 
charge  eliminated  the  last  set  of  problems  we  had  discovered 
with  the  coprocessor  instructions. 

By  using  the  address  lines,  the  resulting  coprocessor 
interface  has  instructions  that  can  be  cached,  does  not  require 
a  large  coprocessor  bua,  allows  efficient  comnwnicsooo 
between  the  processor  registers  and  the  coprocessor  registers, 
sod  lets  s  single  coprocessor  have  direct  access  to  memory. 


Branches 


wisest  choke  so  we  added  the  separate  adder  to  compute  the 


destination. 


Having  set  oat  the  initial  architecture  of  the  machine,  we 
quickly  ran  into  the  problem  of  branches,  and  branch  delays. 
Branches  have  a  considerable  effect  an  the  performance  of  a 
computer  especially  one  that  is  pipelined  as  deeply  as  MIPS- 
X.  The  effects  of  branches  in  a  pipelined  machine  are 
partfcalariy  noticeable  because  branches  interrupt  toe  flow  of 
toe  pipeline.  Decisions  about  the  design  of  toe  pipeline  end 
toe  type  of  branch  scheme  used  are  not  independent.  Control 
complexity  is  a  serious  issue. 

We  very  qakkly  decided  to  w  the  use  of  condition 
codes  in  M1PS-X  if  possible.  This  decisioo  wu  motivated  by 
two  facts.  First,  instruction  trace  statistics  Indicated  that  a 
prior  compute  operation  infrequently  generated  the  condition 
cade  needed  for  a  branch.  In  roughly  80%  of  the  branches  an 
explicit  compare  operation  must  be  performed  to  set  toe 
condition  codes.  A  previous  analytes7  of  empirical  data 
showed  that  toe  number  of  tostructioas  raved  by  condhioe 
codes  was  very  amah  and  essentially  uaeleaa.  Second, 
condition  codes  generate  state  that  needs  to  be  saved  and 
restored  during  exceptions.  Handling  condition  codes  in  a 
pipelined  machine  k  difficult  because  when  an  exception 
occurs,  great  cate  ««■*  be  taken  to  ensure  that  the  correct 
condition  codes  are  saved.  It  seemed  to  us  that  condition 
codes  provide  little  benefit  and  have  potential  complexity 
problems.  In  particular,  generating  code  to  use  condition 
codes  efficiently  k  not  as  straightforward  as  one  might 
expect.  All  the  branch  schemes  considered  for  MIPS-X 
contained  an  explicit  compare  in  the  branch.  This  actually 
reduces  the  amount  of  control  logk  requited  because  there  ia 
no  need  to  worry  about  how  to  save  thk  state. 

Two  witometk  operations  me  required  to  execute  a  branch 
ioairuttion.  One  k  to  compute  toe  branch  condition  and  the 
other  k  to  compute  the  branch  destination.  A  machine  that 
■sea  condition  codes  computes  the  branch  condition  before 
toe  actual  branch  instruction  and  saves  the  condition  to  a 
condition  code  register.  The  first  idea  conceived  for 
implementing  branches  in  MIPS-X  computed  the  condition  in 
toe  branch  instruction,  but  did  not  compute  die  branch 
destination.  Instead  the  branch  destination  was  made 
explkitiy  visible  hi  the  rechitecture.  The  user  would  have  to 
load  a  register  called  PC+l  with  the  branch  dcatinkjon.  The 
branch  instruction  computes  a  condition  rod  (hen  selects 
PC+l  at  the  next  sequential  instruction  depending  oc  die 
computed  condition.  An  observation  was  made  that  many 
inner  loops  contain  several  forward  branches  due  to  construct! 
like  if-then-ebe  statements  so  it  would  be  good  to  have 
several  PC+l  registers.  Four  was  felt  to  be  sufficient.  This 
would  allow  the  compiler  to  hoist  the  destination  address 
calculations  out  of  the  loop.  Without  this  feature,  the  contents 
of  PC+l  would  have  to  be  loaded  from  a  register  for  each 
branch  within  die  bop  for  each  iteration  of  the  loop. 

Thto  scheme  still  had  the  problem  that  there  was  rome  Hale 
tom  must  he  saved  (toe  PC+l  registers)  when  an  exception 
occurred.  Ako,deckHnghowtousefoeFC+f  registers  could 
be  cumbersome  for  die  compiler  lystem.  Finally,  with  four 
apreinl  registers,  M  was  no  longer  clear  that  this  solution  wu 
easier  to  implement  thro  simply  including  a  separate  adder  to 
compute  the  deethtatinn  while  the  ALU  performed  the 
compmkon.  At  this  point  ia  die  design,  adding  a  hide 
hmdsrare  to  die  datapath  to  make  the  control  simpler  w*s  the 


During  this  period  we  also  became  concerned  about  the 
effect  of  the  branch  delay  slots  on  die  machine’s  performance. 
Often  in  a  pipelined  machine  one  or  more  instructions 
following  a  branch  are  fetched  before  toe  reash  of  the 
condition  evaluation  ia  known.  If  these  interactions  me 
executed,  then  the  machine  is  laid  to  have  a  delayed  branch 
meaning  the  effect  of  die  branch  ocean  after  the  actaal  branch 
interaction.  The  number  of  cycles  or  delay  data  that  execute 
after  die  branch  interaction  rod  before  the  actual  branch 
occurs  is  called  the  brtmek  delay.  Filling  them  delay  slots  is 
onta  temple  teak**- 10  end  effects  the  overall  performroce. 

In  the  MIPS-X  pipeline,  it  k  moat  straightforward  to 
implement  a  branch  with  a  delay  of  two.  The  ALU  k  used  to 
compute  the  branch  condition  during  the  third  (ALU) 
pipestage.  Pilling  two  delay  slots  did  not  seem  very 
promising.  Using  data  from  MIPS  interaction  traces,  we 
expected  over  50%  of  the  slots  to  remain  empty*.  This 
performroce  problem  lead  to  dkcuestone  about  how  to  reduce 
the  branch  delay  to  1  cycle,  rod  whether  we  could  use  branch 
prediction  to  help  reduce  the  wateed  cycles11*  ll. 

A  quick  compart*  was  proposed  as  a  method  to  reduce  the 
branch  delay.  In  thk  scheme,  simpk  comparisons  between 
the  two  source  registers  me  done  before  the  ALU  cycle.  Thto 
comparison  would  be  performed  at  the  end  of  the  RF  cycle  by 
placing  a  comparator  an  the  output  of  the  register  file.  Only 
equality  rod  sign  compwkona  can  be  obtained  using  thk 
method  since  there  k  not  enough  time  far  a  arithmetic 
operation.  Other  conditions  such  at  jrutear  than  would 
require  two  steps.  The  ALU  operation  is  done  first  and  toe 
result  ktoond  in  a  register.  Thk  result  la  then  ueud  to  a  quick 
sign  compare  instruction. 

The  main  question  Ate  needed  to  be  resolved  initially  was 
what  percentage  of  branches  could  be  handled  by  a  quick 
compare.  Statistics  from  Katevcnk’i  fecak  todfctec  tote  by 
charting  the  compiler  slightly,  about  80%  of  all  branches  can 
be  converted  into  quick  compare#7,  but  thk  amaro  that  20% 
of  all  branches  take  two  cycles.  Our  initial  statistics  indicated 
that  the  number  of  brunches  that  could  be  handled  using  a 
quick  compare  was  between  20%  and  80%. 

The  quick  compme  was  eventually  dropped  hecrom  it 
could  potentially  lengthen  the  procesaor  cycle  time.  The 
comparator  circuit  mute  operate  an  the  source  buses  leading 
to  the  ALU  and  since  the  values  on  the  buses  could  come 
from  a  bypass  source  it  was  possible  tote  toe  buses  would  not 
be  stable  until  late  into  that  cycle,  particularly  for  a  previous 
memory  fetch  because  the  data  would  only  be  back  at  the  very 
end  of  the  cycte.  For  the  quick  compare  to  operate,  we  would 
need  to  perform  a  compare  on  these  values  and  then  use  thk 
result  to  select  the  correct  address  of  the  next  instruction.  The 
potential  increase  in  cycte  time  discounted  ik  slight  advantage 
in  the  average  number  of  cycles  it  takes  to  complete  •  branch. 
In  retrospect,  our  decisioo  wu  correct.  In  the  final  machine, 
the  delay  from  the  generation  of  the  branch  signal  to  driving 
the  correct  value  on  the  PC  Bus  k  long  (measured  to  be  about 
20  ns).  Even  providing  t  full  phase  to  drive  this  path  leaves  it 
on  a  critical  path. 

Left  with  a  branch  delay  of  2,  we  tevestigatod  branch 
prediction  u  •  way  to  reduce  the  effective  branch  delay. 
There  were  two  prediction  algorithms  tried:  branch  cache,  and 
wwir  prediction.  The  branch  cache  wu  quickly  discarded 


when  we  discovered  that  it  had  to  be  fairly  large  (natch 
greater  than  16  entries)  to  get  a  high  hit  rate.  It  would  also 
affect  the  the  of  oer  Instruction  cache.  Besides,  it  never  did 
nek  better  dtan  static  prediction  and  was  ranch  more 
compter  Static  prediction  woeld  ase  information  at  compile 
time  (possibly  with  profiting)  to  predict  which  way  a  branch 
woeld  go. 

To  nuke  ase  of  die  prediction  information  ere  considered 
implementing  squashing,  the  ability  to  convert  an  testraction 
into  a  no-op  if  the  branch  did  oo<  go  in  the  predicted  direction. 
In  MIPS,  the  teHocdona  la  the  branch  delay  slots  am  always 
executed.  The  strategy  for  choosing  instructions  is  Id  tint  try 
to  move  an  Inahaction  from  before  the  branch  into  the  slot  If 
no  instrectioas  can  be  moved  past  the  branch  the  next  choice 
la  to  find  testractkms  from  the  destination  or  the  sequential 
path  that  have  no  effect  if  the  branch  goes  the  wrong  way. 
Thru  if  yon  predict  correctly,  the  slot  performs  a  useful 
kmsrection  and  if  the  branch  goes  the  other  way,  the  slot 
teshection  is  simply  wasted.  The  last  alternative  is  to  place  a 
no-op  instrectiou  In  the  slot  Squashing  relaxes  the  restriction 
on  the  second  choice  for  instructions.  It  allows  any 
inspection  from  the  breach  dr  sti  nation  to  be  placed  in  the 
slot,  even  when  dura  is  an  advene  effect  if  the  branch  goes 
dw  wrong  way.  The  machine  squashes  the  testraction  (tarns 
it  into  sno-ep)  if  the  branch  goes  the  wrong  way. 

With  squashing  there  am  three  options  for  dealing  arilh  the 
testtocliont  in  the  delay  slots  giving  three  possible  branch 
types:  no  if  ate  where  the  slot  instructions  are  always 
executed,  aytmte  if  don't  go  where  the  slot  instrectioas  are 
ixecased  if  the  branch  takes  and  squash  (f  go  where  the  slot 
teeSracttons  ate  executed  if  the  branch  does  not  take.  Since 
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branrhes  go.  MIPS-X  only  hm  the  first  two  types  of  branches. 
This  requires  only  one  Ml  in  the  instruction  to  specify  how  to 
deal  with  dm  infections  in  the  slots. 

Various  comhteadnas  of  one  and  two-slot  schemes  with 
and  wPhoet  squashing  were  evaluated.  The  results  are  shown 
te  Table  1.  The  mo  equate  scheme  is  the  same  m  mad  in 
MIPS  where  the  temrections  in  the  slots  are  always  executed. 
The  ahrmys  equate  scheme  only  uses  the  equate  if  go  and 
equate  if  don't  go  actions  for  the  tesPuctions  in  du  branch 
riots.  Ths  equate  optional  scheme  includes  the  uae  of 
branches  wife  mo  equate  inspections  te  du  slots  as  well  as 
having  branches  with  squashing,  k  can  be  seen  that  by 
alhnring  squashing  du  efficiency  of  branches  is  much  batter. 


Branch  Scheme 

Cyclea/B  ranch2 

2-stot  no  aquaeh 

2.0 

2-slot  always  squash 

13 

2-slot  squash  optional 

13 

1 -stot  no  squash 

14 

1 -slot  always  squash 

13 

1 -slot  squash  optional 

1.1 

Table  1:  Average  Cycles  per  Branch  Instruction 
for  Various  Branch  Schemes 


Hr  te  ef  tea  branch  daisy  dam  could  be  flBsd  wite  awful 
temustioua,  tern  ws  wuM  achieve  tea  idml  at  a  I  cycta  branch. 
Aay  an-oy  iaatruetiaas  in  tea  branch  delay  rhea  arc  Wrteuud  totes 


The  scheme  we  finally  chose  uses  the  fell  compare  and 
equate  optional  with  two  slots.  Our  initial  estimates  about 
the  cost  of  du  double  slots  turned  out  to  be  slightiy  optimistic. 
Where  we  predicted  the  average  branch  would  take  13 
cycles,  results  using  du  actual  reorganizer  showed  that  the 
average  branch  took  about  13  cycles  for  small  bench  ms  its 
using  traditional  optimization.  However,  we  have  since 
developed  better  optimization  techniques  and  our  most  recent 
results  show  that  even  with  luge  Pascal  and  lisp  bench  maria 
the  average  branch  takes  137  cycles. 

Implementing  squashing  was  a  gamble  because  we  were 
not  completely  rare  how  it  would  affect  exception  handling  at 
du  time  we  made  the  commitment  to  uae  it  It  turned  out  that 
they  mete  together  very  well  as  described  te  the  next  section. 


Exception  Handling 

As  the  design  of  the  machine  progressed,  our  concentration 
shifted  from  the  functions  the  machine  was  going  to  perform 
to  how  these  functions  were  going  to  be  controlled.  MIPS-X 
benefited  greatly  from  the  experience  gained  during  the  MIPS 
design.  Handling  exceptions  in  MIPS  caused  the  most 
complexity  te  the  machine  because  of  the  large  number  of 
possible  stales  te  tee  processor  during  an  exception.  These 
states  were  the  result  of  tee  processor  trying  to  complete  tee 
instructions  that  occurred  conceptually  before  the  fault  but 
still  in  tee  pipeline,  rod  reloading  tee  partially  fell  pipeline  on 
a  return  tram  an  exception.  The  goal  for  MIPS-X  was  to 
require  as  few  stales  as  possible  to  handle  an  exception  so  the 
state  marhinr  design  would  not  be  difficult  The  underlining 
rule  was  to  keep  it  simple,  stupid1*. 

In  some  ways  exception  handling  in  MIPS-X  followed  the 
MIPS  model.  Exceptions  we  not  vectored  so  the  exception 
handler  must  first  determine  the  cause  of  the  exception.  On 
MIPS  there  waa  an  oo-chip  surprise  register  where  this 
information  was  stored.  MIPS-X  relies  instead  on  a  separate 
off-chip  interrupt  control  unit  test  coo  taint  this  information. 
The  PSW  does  contain  bits  that  determine  whether  the 
exception  was  caused  by  w  interrupt,  arithmetic  overflow  or  s 
noo- maskable  interrupt 

MIPS-X  differed  from  MIPS  in  how  exception  affected 
tee  pipeline.  The  MIPS  exception  sequence  started  with  tee 
pipeline  being  flushed  of  as  many  instructions  as  possible  that 
were  already  executing.  Then  die  program  counter  (PC)  was 
zeroed  rod  tee  return  PC*  saved  from  the  PC  chain.  The 
flushing  of  the  pipeline  caused  a  great  many  extra  states  red 
added  a  lot  of  complexity. 

In  MIPS-X  the  pipeline  is  halted  when  an  exception 
occurs.  No  instructions  are  completed.  The  PC  is 
immediately  set  to  zero  and  the  shift  chain  of  old  PC  values  is 
frozen,  saving  tee  addresses  of  the  instructions  that  arc  still  in 
tee  pipeline.  The  current  PSW  is  placed  in  PS  Wold, 
interrupts  are  turned  off  and  tee  machine  is  placed  into  system 
mode.  The  exception  routine,  located  at  address  aero  in 
system  space,  begins  execution  by  lint  saving  tee  three  PCs 
from  the  PC  chain  and  PSWotd  onto  the  system  stack.  Osoe 
the  stele  of  the  interrupted  process  is  saved,  teen  PC  shifting 
can  be  enabled  and  interrupts  unmasked  if  desired.  The 
restart  sequence  involves  reloading  the  PC  chain  with  the 
three  saved  PCs  and  then  doing  three  jumps  using  the 
contents  of  the  PC  chain;  tee  PC  chain  is  used  to  store  the 


Control 


Mm  addresses  during  die  return  icqnence.  Interrupt*  meat 
be  disabled  both  daring  machine  itate  saving  and  (catering. 

Daring  the  djacatiioas  about  how  bnochea  were  to  be 
imptemtwted,  there  waa  tome  coocern  about  the  effects  the 
branch  implementation  would  have  on  exception  handling. 
The  original  feeling  was  that  having  more  branch  slots  would 
require  mote  stale  in  the  machine  and  implementing 
squashing  branches  would  make  the  stale  machine  even  mote 
complicated  The  squash  proponents  argued  that  the 
hardware  needed  to  freeze  the  pipeline  during  an  exception 
could  be  used  to  implement  squashing  branches.  They  not 
only  convinced  die  design  team,  they  also  tamed  out  to  be 
correct  Squashing  two  branch  slots  only  requires  a  single 
extra  input  to  the  squashing  finite  stale  machine  that  is  used  to 
handle  exceptions.  Branch  squashing  and  squashing  for 
exceptions  are  very  similar. 

The  general  scheme  used  to  noop  sn  instruction  is  quite 
simple.  AH  that  needs  to  be  done  is  to  set  a  bit  ie  the 
destination  specifier  for  that  instruction.  This  bit  is  used  by 
die  register  file  to  determine  whether  to  perform  a  write  or 
not  There  are  2  lines  in  the  machine  that  can  set  this  bit, 
Exception  and  Squash.  Exception  no-ops  the  instructions  in 
the  ALU  and  MEM  stages  of  the  pipeline,  while  Squash 
noops  the  instructions  currently  in  the  IP  and  RF  stages  of 
the  pipeline.  The  only  added  complexity  occurs  with  die 
Mult/Div  register  and  the  PSW  which  contains  the  only 
visible  state  outside  of  die  register  file.  Writes  to  these 
locations  are  also  prevented  by  Exception  and  Squash. 

There  is  only  one  exception  generated  on  chip  and  it  ie  s 
trap  on  overflow  in  the  ALU  or  the  multiplication/ division 
hard  we.  At  the  Wart  of  the  design  it  was  felt  diet  detection 
overflows  and  generating  •  trap  waa  too  complex  to  do.  The 
original  solution  was  the  concept  of  a  Micky  overflow  ML  If 
an  overflow  occurred  then  the  sticky  overflow  hit  would  be 
set  in  the  PSW.  This  bit  could  then  be  checked  at  a  later  time 
to  determine  whether  an  overflow  had  occurred.  This  meant 
that  It  would  not  be  possible  to  precisely  detect  the  occurrence 
of  the  overflow  but  at  least  it  was  pomible  to  indicate  the 
presence  of  an  incorrect  result.  We  began  looking  for  other 
overflow  mechanisms  when  we  discovered  tiut  the  Micky 
overflow  bit  interacted  badly  with  bypassing.  Instead  of 
making  the  hardware  simple,  it  seemed  to  make  the  PSW 
harder  to  design. 

Several  iilat  simple  schemes  were  then  proposed.  Qse 
was  a  SetOnAJJOverflow  instruction  that  just  routed  the 
overflow  bit  from  the  ALU  into  the  most  significant  bit  of  the 
ALU  result  This  instruction  could  then  be  used  to  determfair 
whether  the  addition  causes  an  overflow  by  simply  testing  for 
die  sign  of  the  tank.  Another  suggestion  was  s  Branch  on 
Overflow  instruction  that  caused  a  branch  if  the  result  of  the 
branch  comparison  overflowed.  These  were  minimal 
hardware  solutions  diet  would  provide  some  small  support  for 
overflow  detection. 

At  this  point  die  exception  hardware  had  been  designed 
and  we  observed  (hat  generating  a  true  trap  on  ovarflow  was 
not  difficult;  in  feet  it  was  simpler  than  the  original  sticky 
overflow  ML  We  decided  to  absndoo  (he  sticky  overflow  bit 
far  a  maskable  trap  on  overflow. 


Our  overriding  goal  for  the  control  section  was  to  keep  it 
as  simple  as  possible.  In  pat  we  accomplished  our  goal  by 
eliminating  hardware  features  that  would  complicate  the 
machine  without  providing  significant  performance 
advantages.  We  also  tried  to  keep  a  uniform  view  of  the 
hardware,  trying  to  reuse  the  seme  control  mechanism  for 
many  features.  Merging  exceptions  and  squashing,  and 
merging  memory  instructions  snd  coprocessor  operations 
were  examples  of  this  strategy.  Finally,  we  etimimtwt  the 
global  controller  for  the  machine  snd  replaced  it  with  a  set  of 
smaller  controllers,  one  for  each  section  of  the  datapath.  We 
further  partitioned  the  design  so  that  a  single  designer  was 
responsible  for  both  the  datapath  and  control  in  his  section, 
giving  each  designer  the  incentive  to  make  his  control  section 
simpler.  Most  of  the  machine  control  is  simple  decoders, 
many  generated  automatically  using  PLA  generators. 

One  technique  that  MIPS-X  used  to  great  advantage  was  a 
qualified  clock,  called  yl,  to  latch  the  control  state  of  the 
machine.  This  clock  ie  the  pi  clock  qualified  with  not 
atonal  cache  mist  and  not  intermtf  cache  mitt.  When  either 
cache  misses,  the  qrl  dock  does  not  rise,  and  the  control  state 
does  not  shift  down  the  pipeline  control  latches.  The  lack  of  ■ 
qrl  clock  causes  the  machine  to  execute  the  previous  p2  phase 
before  retrying  the  pi  phase.  This  simple  technique  made 
temporary  stalling  of  the  entire  pipeline  very  easy,  and 
allowed  us  to  implement  the  late  miss  described  earber 
without  greatly  increasing  the  machine  complexity.  Since  the 
qrl  dock  is  only  allowed  to  clock  control  state  im*— ,  its 
pulse  width  can  be  quite  narrow  (about  10  ns).  As  lory  «  fee 
miss  signal  la  moootooic,  it  is  possible  to  detect  a  cache  hit 
after  fee  data  has  been  latched  in  the  machine  without  stalling 
the  machine. 

Together  these  control  techniques  were  quite  successful. 
The  con  trot  waa  nicely  divided  among  the  4  main 
sections,  with  the  only  turn  finite  state  machines  (PS Ms) 
residing  in  the  PC  uniL  These  PSMs  handle  instruction  cache 
misses  and  instruction  squashing  during  exceptions  and 
squashed  branches.  The  stale  diagrams  for  fee  two  — tM— 
are  shown  in  Figures  3  snd  4.  These  PSMs  are  implemented 
as  simple  shift  registers  with  s  very  small  amount  of  random 
logic  and  occupy  less  than  0.2%  of  fee  total  ana  of  the  chip. 


Status  and  Conclusions 

The  MIPS-X  project  began  in  earnest  during  the  summer 
of  1984.  By  January  1985,  we  had  settled  on  an  initial 
version  of  the  instruction  set,  and  had  written  an  instruction 
level  simulator  for  the  machine.  We  were  able  to  use  much  of 
fee  software  system  that  was  creeled  for  MIPS  for  MIPS-X  as 
well.  This  greatly  reduced  the  software  development  effort 
The  compiler/ tiinulalor  system  generated  instruction  traces 
that  we  used  to  gather  cache  statistics  and  fine  tune  the 
architecture.  By  April  1985,  fee  architecture  had  stabilized 
and  wort  oc  the  detailed  design  accelerated.  We  ran  our  first 
instruction  through  s  detailed  functional  simulator  of  the 
entire  processor  during  the  summer.  The  final  design  was 
taped  out  at  the  end  of  April  1986  and  we  received  first 
silicon  back  in  October. 

The  procemor  was  designed  to  run  M  a  clock  rate  of  20 


MHz,  euaUing  m  jMtnctfcw  every  cycle,  yielding  a  peak 
prrfnrmanrr  of  20  MIPs.  Timing  analysis  showed  that  ihe 
version  that  waa  ritipped  in  April  would  nm  at  about  16  MHz. 
Initial  timing  teau  have  dbown  iiat  the  part  ii  felly  functional 


Simulations  of  our  large  Pascal  benchmarks  show  that 
15.6%  of  all  instructions  ate  no-ops  due  to  unused  branch 
delays  or  other  pipeline  interlocks  that  cannot  be  optimized 
away.  For  Lisp,  this  number  increases  slightly  to  18.3%  due 
to  a  larger  number  of  jumps  and  many  load-load  interlocks 
caused  by  chasing  car  and  cdr  chains1*.  When  the  memory 
system  overhead  is  included  (delays  from  I  cache  and  Ecache 
misses),  die  average  instruction  requires  about  1.7  cycles 
meaning  MIPS-X  should  have  a  sustained  throughput  above 
11  MIPt.  Our  benchmark  programs  have  static  code  sizes  in 
the  range  of  50  KBytes  to  270  KBytes  so  we  cannot  get  exact 
numbers  for  the  effects  of  the  external  cache  because  most  of 
the  bench  marts  fit  entirely.  Smith's  numbers15  are  not  large 
enough  so  we  used  much  larger  traces16  to  derive  the  Ecache 
effects. 

The  performance  of  a  machine  is  based  on  three  factors: 
the  number  of  instructions  executed  (path  length),  the  number 
at  cycles  per  instruction  and  the  cycle  time.  Ideally,  all  three 
factors  should  be  mbilmtMut  but  we  have  shown  that  by 
having  simple  instruction  decode  we  can  significantly 
decrease  the  latter  two  factors  without  adversely  affecting  the 
path  length.  Comparison  of  Pascal  programs  with  a  VAX 
11/780  shows  that  MIPS-X  executes  about  25%  more 
instructions  but  executes  the  programs  about  14  times  foster 
for  unoptimized  code.  The  static  code  size  for  MIPS-X  is  also 
about  25%  greater  than  VAX  code.  The  Stanford  compiler 
system  was  used  and  the  only  difference  was  in  the  back  end 
code  generators.  However,  when  MIPS-X  code  is  compared 
to  the  Berkeley  Pascal  compiler,  the  path  length  is  80%  longer 
and  the  speedup  is  only  10  times  foster  than  the  VAX.  Much 
of  this  difference  may  be  due  to  poorer  code  from  our  VAX 
code  generator.  We  feel  that  when  we  get  the  results  for 
optimized  code,  the  numbers  will  be  somewhere  in  between. 

The  goal  of  the  MIPS-X  project  from  the  beginning  was  to 
learn  from  MIPS  and  design  a  simpler  yet  faster  processor. 
The  emphasis  in  all  design  decisions  throughout  the  project 
was  simplicity:  minimize  state  and  keep  the  control  simple. 
The  implementation  of  MIPS-X  has  shown  that  it  is  possible 
to  implement  a  high  performance  microprocessor  that 
supports  coprocessors,  without  requiring  complex  control  or 
hundreds  of  pins. 
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